Unit 2 Module 2b

Spatial Data Analysis with R (01:450:320)

1 Algebra, categorizing, and sampling

Before starting, let us define a function to load packages quietly. Since we will be using several different libraries in this module, this approach ensures our workspace remains clean and prevents installation logs or startup messages from cluttering the output.

# Define a quiet loader function
load_quietly <- function(pkg) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    # Install without outputting progress bars or logs
    install.packages(pkg, dependencies = TRUE, quiet = TRUE, 
                     repos = "http://cran.us.r-project.org")
  }
  # Load the library while suppressing startup messages and warnings
  suppressMessages(suppressWarnings(library(pkg, character.only = TRUE)))
}

Let’s move on and look at some additional raster-based analyses. The next few sections depend on objects created in the previous sections (Unit 2 Module 2a). The code to recreate those is here:

# Load sdadata
load_quietly("sdadata")

bioclim <- rast(system.file("extdata/bioclim.tif", package = "sdadata"))
districts <- system.file("extdata/districts.geojson", package = "sdadata") %>% 
    st_read %>% mutate(ID = 1:nrow(.))
#> Reading layer `districts' from data source 
#>   `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/districts.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 170 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: 29.58953 ymin: -11.76235 xmax: 40.44473 ymax: -0.983143
#> Geodetic CRS:  WGS 84
bioclimz <- mask(x = bioclim, mask = districts)

tzar <- rast(x = ext(districts), crs = crs(districts), res = 0.1)
values(tzar) <- 1:ncell(tzar)
tzar2 <- rast(x = ext(districts), crs = crs(districts), res = 0.25)
values(tzar2) <- 1:ncell(tzar2)

distsr <- rasterize(x = districts, y = tzar, field = "ID")
distsr_rs <- resample(x = distsr, y = bioclimz, method = "near")  # match extent

species <- system.file("extdata/species.csv", package = "sdadata") %>% read_csv
#> Rows: 6951 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): family, species
#> dbl (4): month, year, x, y
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
speciesr <- species %>% 
  dplyr::select(x, y) %>%
  mutate(count = 1) %>% 
  st_as_sf(coords = c("x", "y"), crs = 4326) %>% 
  rasterize(x = ., y = tzar2, field = "count", fun = sum) %>% 
  print()
#> class       : SpatRaster 
#> size        : 43, 43, 1  (nrow, ncol, nlyr)
#> resolution  : 0.2524466, 0.2506792  (x, y)
#> extent      : 29.58953, 40.44473, -11.76235, -0.983143  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source(s)   : memory
#> name        :  sum 
#> min value   :    1 
#> max value   : 1216
roads <- st_read(system.file("extdata/roads.geojson", package = "sdadata"))
#> Reading layer `roads' from data source 
#>   `/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/sdadata/extdata/roads.geojson' 
#>   using driver `GeoJSON'
#> Simple feature collection with 4569 features and 1 field
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 478749.8 ymin: -1385619 xmax: 1591588 ymax: -118296.2
#> Projected CRS: Africa_Albers_Equal_Area_Conic

# And don't forget
plot_noaxes <- function(x, axes = FALSE, box = FALSE, mar = c(1, 0.5, 2, 4),
                        ...) {
  if(!class(x) %in%
     c("RasterLayer", "RasterStack", "RasterBrick", "SpatRaster")) {
    stop("This function is intended for rasters only", call. = FALSE)
  }
  par(mar = mar)
  if(class(x) %in% c("RasterLayer", "RasterStack", "RasterBrick")) {
    plot(x, axes = axes, box = axes, ...)
  } else {
    plot(x, axes = axes, mar = NA, ...)
  }
}

1.1 Algebra and selection

We can do any number of raster calculations using basic mathematical expressions. Here are some very basic ones:

# Chunk 36
r1 <- bioclimz[[1]] / bioclimz[[6]]
r2 <- bioclimz[[1]] + bioclimz[[6]]
r3 <- bioclimz[[1]] - bioclimz[[6]]
r4 <- bioclimz[[1]] * bioclimz[[6]]

s <- c(r1, r2, r3, r4)
names(s) <- c("divide", "sum", "subtract", "multiply")
plot_noaxes(s)

As you have seen in the previous section, we can also perform logical operations on rasters, selecting cells that meet certain criteria.

# Chunk 37
# # 1. Standardize functions and layers
# bio1 = Annual Mean Temp, bio12 = Annual Precip, bio15 = Precip Seasonality
temp <- bioclimz[["bio01"]]
precip <- bioclimz[["bio12"]]
precip_cv <- bioclimz[["bio15"]]

# #2. Identify Extremes: High Heat and Low Rainfall
# (Example threshold: Temp > 25°C and Rain < 600mm)
hot_and_dry <- temp > 25 & precip < 600

# #3. Regional Focus (District 145)
# Masking Annual Rainfall to a specific district
dist145 <- distsr_rs == 145
dist_precip <- dist145 * precip
dist_precip[dist145 == 0] <- NA

# #4. Regional Threshold: Above Average Rainfall within District 145
mu_dist <- unlist(global(dist_precip, mean, na.rm = TRUE))
dist_precip_high <- dist_precip > mu_dist

# #5. National Threshold: High Seasonality
# Areas where precipitation seasonality (CV) is higher than the national mean
mu_cv <- unlist(global(precip_cv, mean, na.rm = TRUE))
high_seasonality <- precip_cv > mu_cv

# Combine for plotting
s <- c(temp, precip, hot_and_dry, dist_precip, dist_precip_high, high_seasonality)
titles <- c("Mean Temp (BIO1)", "Annual Rain (BIO12)", "Extreme: Hot & Dry", 
            "Dist 145 Rain", "Dist 145 > Local Mean", "High Seasonality (BIO15)")
plot_noaxes(s, main = titles)

In #1, we extract the annual mean temperature, annual precipitation, and precipitation seasonality (CV) directly from the bioclimz dataset using their layer names. Then, in #2, we identify “climate stress” zones: cells that have both high annual mean temperatures (>25°C) and low annual rainfall (<600 mm).

In #3, we examine the annual precipitation within district 145. This is done in several steps. First, we create a mask (dist145), where cells in district 145 are 1, and everywhere else 0. Next, we multiply this mask by our precip layer (BIO12), thereby masking out rainfall values outside the district. At this point, if we calculated any subsequent statistics on district 145 precipitation, they would be invalid because there would be a lot of 0 values for rainfall outside of district 145 included in the calculation (run plot(dist_precip) right after dist_precip <- dist145 * precip to see). This would result in an artificially low mean precipitation, for example. Therefore, in line c, we identify the indices of the cells in dist54 that have 0s (dist54 == 0), and use those indices to set all corresponding values in dist_precip to NA.

Having made dist_precip safe for calculating statistics, in #4 we then identify the areas in district 145 that received more annual rainfall than the district mean (extracted using global, which returns a data.frame with a single value, which we get by unlisting). In #5, we do the same thing for the entire country, identifying areas across all of Tanzania where the annual rainfall is higher than the national average.

1.2 Categorizing

Sometimes you will want to create a categorical raster from a continous one. We will look at two different ways of doing that.

Method 1:

# Chunk 38
qtiles <- global(precip,
                 function(x) quantile(x, probs = seq(0, 1, 1/3), na.rm = TRUE))
raintotcut <- classify(x = precip, rcl = unlist(qtiles), include.lowest = TRUE)

cols <- c("tan", "yellow3", "green4")
plot_noaxes(raintotcut, legend = FALSE, main = "Total Rainfall", col = cols, 
            mar = c(0, 0, 2, 0))
legend(x = "bottomright", legend = c("High", "Intermediate", "Low"), 
       pch = 15, pt.cex = 3, col = rev(cols), bty = "n")

This approach using so-called cuts, perhaps the easiest way of categorizing a continuous raster. This use the function quantile, passed through an anonymous function in the app function, to get the tercile values (i.e 33.3rd, 66.6th, and 100th percentiles) for raintot. The resulting data.frame is vectorized, and these become cuts passed into the rcl argument of terra::classify. We pass the argument include.lowest = TRUE because we want to include all values that are greater than or equal to the lowest value in each category. Our plot of the results has a customized legend, which we chose because we have a categorical raster and thus wanted a categorical, rather than continuous (the default for terra::plot, and thus plot_noaxes), legend.

Method 2 is somewhat more involved:

# Chunk 39
rclmat <- cbind(unlist(qtiles[1:3]), unlist(qtiles[2:4]), 1:3)
rclmat
#>            [,1] [,2] [,3]
#> X0.         324  818    1
#> X33.33333.  818 1022    2
#> X66.66667. 1022 6224    3

raintotrcl <- classify(x = precip, rcl = rclmat, include.lowest = TRUE)

plot_noaxes(raintotrcl, legend = FALSE, main = "Total Rainfall", col = cols, 
            mar = c(0, 0, 2, 0))
legend(x = "bottomright", legend = c("High", "Intermediate", "Low"), 
       pch = 15, pt.cex = 3, col = rev(cols), bty = "n")

In the first step, we reuse the previously calculated terciles to create a reclassification matrix (rclmat). This matrix contains the lower bound of the range of the category in column 1, the upper bound in column 2, and then the new class value that will be assigned in column 3. There are three rows, one per category. This matrix is passed to the rcl argument of classify, in which we also specify the same include.lowest = TRUE argument.

This approach requires slightly more work, but there are cases where you might need to use it, e.g. to reclassify another categorical raster using non-sequential categories.

1.3 Sampling

Many analyses use data sampled from a raster. For example, you might have coordinates for places where you collected field data, and need to collect rasterized environmental data for those locations. The extract function is great for this.

1.3.1 Extract

From the description of extract, the function’s purpose is to:

Extract values from a Raster* object at the locations of other spatial data. You can use coordinates (points), lines, polygons or an Extent (rectangle) object. You can also use cell numbers to extract values.

Here we will use points and polygons:

# Chunk 40
# #1
species_env <- species %>% 
  st_as_sf(coords = c("x", "y")) %>% 
  mutate(rain = terra::extract(x = precip, y = .)[, 2]) %>%
  mutate(district = terra::extract(x = distsr_rs, y = .)[, 2]) %>% 
  print
#> Simple feature collection with 6951 features and 6 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 29.91544 ymin: -11.09595 xmax: 38.79323 ymax: -1.44546
#> CRS:           NA
#> # A tibble: 6,951 × 7
#>    family       species            month  year             geometry  rain district
#>  * <chr>        <chr>              <dbl> <dbl>              <POINT> <dbl>    <dbl>
#>  1 Elephantidae Loxodonta africana     6  2012 (34.08678 -2.276964)   787       74
#>  2 Bovidae      Syncerus caffer        6  2012   (35.75882 -3.5614)   666        3
#>  3 Bovidae      Syncerus caffer        6  2012 (35.44965 -3.230564)  2296        7
#>  4 Felidae      Panthera pardus        6  2012 (34.96447 -2.475424)   736       79
#>  5 Elephantidae Loxodonta africana     6  2012 (35.13262 -1.883995)   881       79
#>  6 Bovidae      Syncerus caffer        6  2012   (35.73392 -3.0035)  1904        7
#>  7 Elephantidae Loxodonta africana     6  2012 (35.61116 -3.004654)  2499        7
#>  8 Elephantidae Loxodonta africana     6  2012 (35.97219 -3.429279)   603        6
#>  9 Bovidae      Syncerus caffer        6  2012 (35.12084 -1.948731)   839       79
#> 10 Elephantidae Loxodonta africana     6  2012 (35.03655 -1.983431)   804       79
#> # ℹ 6,941 more rows

# #2
speciesdistid <- species_env %>% 
  drop_na %>%
  distinct(district) %>%
  pull

distrain <- districts %>% 
  filter(ID %in% speciesdistid) %>% 
  terra::extract(x = precip, y = ., fun = mean) %>% 
  rename(rain = bio12) %>% 
  print
#>    ID      rain
#> 1   1  962.6237
#> 2   2  816.4723
#> 3   3  803.7623
#> 4   4 1016.7394
#> 5   5  759.0712
#> 6   6  965.1684
#> 7   7  653.8991
#> 8   8  583.0270
#> 9   9  653.1400
#> 10 10  688.9380
#> 11 11  709.8703
#> 12 12  629.6706
#> 13 13  617.5856
#> 14 14 1224.4511
#> 15 15 1226.0498
#> 16 16  914.4493
#> 17 17 1094.9187
#> 18 18 1161.9180
#> 19 19 1149.5518
#> 20 20  677.3394
#> 21 21 1935.3883
#> 22 22  727.4703
#> 23 23 1834.6410
#> 24 24  862.7047
#> 25 25  855.8328
#> 26 26  880.3599
#> 27 27  615.7559
#> 28 28  999.1724
#> 29 29  668.0857
#> 30 30  941.8872
#> 31 31 1001.9575
#> 32 32  811.1277
#> 33 33 1062.7733
#> 34 34  821.4642
#> 35 35  771.2630
#> 36 36 1277.9204
#> 37 37 1010.5161
#> 38 38  964.8502
#> 39 39  993.9773
#> 40 40 1343.6949
#> 41 41 1072.8763
#> 42 42 1042.8343
#> 43 43  853.1183
#> 44 44  911.4511
#> 45 45  832.0877
#> 46 46  771.9737
#> 47 47  696.8761
#> 48 48  678.1867
#> 49 49 1031.8238
#> 50 50 1061.6095
#> 51 51  844.5027
#> 52 52 1113.3399
#> 53 53 1070.4145
#> 54 54 1362.6541

# #3
rain_stats <- bind_rows(
  species_env %>% 
    drop_na %>% 
    dplyr::select(rain) %>% 
    st_drop_geometry() %>% 
    mutate(dat = "Species"), 
  distrain %>%
    dplyr::select(rain) %>% 
    mutate(dat = "Districts")
)

# #4
bp_theme <- theme(legend.title = element_blank(), 
                  axis.text.x = element_blank(),
                  axis.ticks.x = element_blank(), 
                  panel.grid.major.x = element_blank(), 
                  panel.grid.minor.x = element_blank(), 
                  panel.background = element_rect(fill = "grey95"))
ggplot(rain_stats) +
  geom_boxplot(mapping = aes(y = rain, fill = dat), position = "dodge2") +
  scale_fill_manual(values = c("steelblue", "cadetblue")) + 
  ggtitle("Rainfall distributions") + xlab(NULL) + ylab("mm") + 
  bp_theme

In #1, we use a pipeline to create a new species_env sf, and we pass that to the “y” argument of terra::extract. The first extract is done from the precip raster, the second from the dists_rs raster. Both extractions are wrapped in mutate, in order to create two new variables for species_env. Note that we have to specify terra::extract because it is masked by dplyr::extract, as sdadata loads tidyverse after raster (remember section 1.2.2 in Unit 1 Module 2). The resulting sf contains rainfall sum and district numbers for each farmer’s location.

In #2, we first pull out of species_env a vector of the distinct district IDs. Because species contains a few bad coordinates that fall outside of Zambia, and there had NA values, we inserted drop_na into the pipeline to get rid of those. We then used speciesdistid to filter out the districts the farmers live in from districts, and used the result to extract from precip the mean rainfall from the pixels of precip falling within each selected district. When using terra::extract with polygons, if you don’t pass a value to the “fun” argument, a list is returned, in which each element contains the extracted pixel values corresponding to a particular polyon. If you give “fun” an argument (e.g. mean, as in this case), you get back a data.frame containing the resulting statistic(s) for each polygon.

In #3, we do some gymnastics in preparation for setting up ggplot boxplots (#4) that compare the distributions of rainfall values. That is, we set up a tibble that binds by row the extracted district and farmer rainfall values, which allows ggplot to automatically create side-by-side boxplots using fill = dat. We create some theme options for ggplot ahead of that (bp_theme), so that we can recycle these in the next example. The plot shows that the rainfall data within species locations provide a biased estimate of district mean rainfall.

1.3.2 Random sampling

Alternatively, you might want to draw a random sample directly from a raster, without all the trials and tribulations of field work (field logistics often make it hard to capture truly randomized, and thus representative, samples). raster has functions for this, and we will build off the previous example to demonstrate them.

# Chunk 41
# #1
speciesdistsr <- distsr_rs %in% speciesdistid
distsrspecies <- mask(x = distsr_rs, mask = speciesdistsr, maskvalue = 0)
# plot_noaxes(distsrspecies)

# #2
set.seed(1)
distsamp <- spatSample(x = distsrspecies, size = nrow(species_env), 
                       cells = TRUE, na.rm = TRUE)
randrain <- precip[distsamp[, 1]]

# #3
set.seed(1)
distsamp_str <- spatSample(x = distsrspecies, method = "stratified", 
                           size = nrow(species_env) / length(speciesdistid),
                           cells = TRUE)
stratrandrain <- precip[distsamp_str[, 1]]

# #4
rand_rain_stats <- bind_rows(
  tibble(distrain %>% dplyr::select(rain), dat = "Districts"),
  tibble(randrain %>% rename(rain = bio12), dat = "Simple"),
  tibble(stratrandrain %>% rename(rain = bio12), dat = "Stratified")
) %>% drop_na

ggplot(rand_rain_stats) +
  geom_boxplot(mapping = aes(y = rain, fill = dat), position = "dodge2") +
  scale_fill_manual(values = c("lightblue", "steelblue", "cadetblue")) + 
  ggtitle("Rainfall distributions") + xlab(NULL) + ylab("mm") + 
  bp_theme

In this case, we first (#1) mask the raster to exclude districts having no species (you can run the commented-out code to see the map). #2 uses the function terra::spatSample to take a straight random sample from the species-hosting districts. We set the “size” argument so that it is equal to nrow(species_env), i.e. our sample is equal to the number of species in species_env (793). We use the argument cells = TRUE to select the cell numbers, which is what we are really interested in (if we didn’t, we would just get the values of those cells, i.e. the district IDs). We then use those cell ids (the first column of distsamp) to extract the rainfall values from precip. We can use the cell IDs selected from one raster to extract values from another only if the two rasters have an identical number of cells and extents, as is the case here. You can confirm that this is the case for distsrspecies and precip by running identical(ncell(precip), ncell(distsrspecies)) and ext(precip) == ext(distsrspecies).

In #3 we apply the more correct way of collecting a representative sample from units of varying size, which is to stratify the sample by districts. To do this, we uses the function spatSample again with “method” set to “stratified”, which uses the district IDs in distsrspecies to create the strata, and the “size” argument specifies the size of the sample that should be drawn from each stratum (i.e. district). To estimate that size, we divide nrow(species_env) by the number of species-hosting districts, so that we end up with a total sample equal to 6951. Note that this function merely ensures that the same sample size is drawn from each stratum (district), and does not make the sample size in each stratum proportional to its area.

In #4, we redo the boxplots, showing the results from the two approaches to random sampling alongside the distribution of district mean rainfall values (extracted in Chunk 40). Both approaches produce statistics that are quite representative of the district rainfall. The stratified approach is closest, as the median aligns more closely with that of distrain (the boxplot for “Districts” in the plot).

1.4 Practice

1.4.1 Questions

  1. What are two ways of converting a continuous to categorical raster?

  2. What are two methods of sampling from a spatRaster? Are the others (a look in terras help files will be informative)?

1.4.2 Code

  1. Use app (must use app) to find the difference between the “Max Temperature of Warmest Month” (bio05) and the “Min Temperature of Coldest Month” (bio06) for every cell in bioclimz.

  2. Use raster algebra to find the country-wide average for this calculated Temperature Range, and then identify which areas of Tanzania have a range that is greater than that average.

  3. Create a new categorical raster from precip, using the quantiles to define the new category boundaries together with the cut method applied to classify. Use plot_noaxes to show the result.

  4. Create a new sub-sample of rasterized districts, randdistsr, totalling 15 districts randomly selected using dplyr::sample_n (use a seed of 11) from districts. Use precip as the rasterization target. Mask precip so that it is confined to those districts (with NAs values replacing the values in the other districts). Call it newrandrain.

  5. Use spatSample with a size of 300 to get a random sample of rainfall values from newrandrain (use a seed of 1), saving the result as randsamp. Use spatSample with cell = TRUE to get a sample of 300 / 15 samples from randdistsr. Use the cell IDs to extract the rainfall values from newrandrain, saving the output vector as stratsamp. Create a tibble that binds these two vectors by row (as in Chunk 41 #4), and then plot the two results as side-by-side boxplots with ggplot.

2 Terrain analysis, interpolation, and modeling

We’ll round out this module with some additional raster-based analyses and a bit of modeling. The code below recreates the objects made during prior section needed for this analysis.

2.1 Terrain analysis

We’ll start off this section doing some terrain analysis, with a few little asides on the way to illustrate pixel area and some more plotting skills.

First we need a digital elevation model. The package geodata (note, you will likely have to install this using install.packages("geodata")) provides several very nice functions that allows us to download several different gridded geographic datasets, which includes DEMs:

# Chunk 42
load_quietly("geodata") # load geodata package

dem <- geodata::elevation_30s(country = "TZA", path = tempdir())
plot_noaxes(dem, main = "Tanzania DEM", plg = list(title = "meters"))

The help file for geodata::elevation_30s tells us the following datasets are available:

Elevation data for any country. The main data source is Shuttle Radar Topography Mission (SRTM) , specifically the hole-filled CGIAR-SRTM (90 m resolution) from https://srtm.csi.cgiar.org/. These data are only available for latitudes between -60 and 60.

The 1 km (30 arc seconds) data were aggregated from SRTM 90 m resolution data and supplemented with the GTOP30 data for high latitudes (>60 degrees).

There are two other version (..._3s and ..._global), representing different resolutions and extents, which you can explore further at your leisure. In our example, we choose the coarser 1 km aggregated version of the SRTM DEM, for ease of download, and specify that we want it for Tanzania. geodata has functions for many other datasets also.

2.1.1 Pixel resolution/area

The DEM we downloaded is in decimal degrees with a resolution of 0.008333333°, which means the area of each pixel varies through Tanzania. To illustrate that, we use a nice function that terra provides, which lets you calculate the pixel areas, even when the raster is in geographic coordinates:

# Chunk 43
demarea <- cellSize(dem, unit = "ha")
plot_noaxes(demarea, plg = list(title = "ha"), 
            main = paste0("DEM pixel area, mean = ", 
                         round(global(demarea, mean), 1)))
plot(st_geometry(districts), add = TRUE)

The result is in hectares (as specified in the argument to cellArea), and you can see that the area of each pixel falls as the distance from the equator increases (this is why we use equal area projections to calculate area).

2.1.2 An aside on plot annotation

Although we covered graphics in the previous unit, let’s look a bit more at plot adornments for a second. Notice in the last two plots how we use the “plg” argument in plot_noaxes to put a title over the legend bar.

There are some other things we can do to plots to add symbols to text within the plots: expression and several other functions can be used to annotate plots with more complex mathematical formulae:

# Chunk 44
set.seed(1)
a <- sapply(20:100, function(x) rnorm(n = 100, mean = x, sd = x / 10))
mu <- colMeans(a)  # means
mumu <- round(mean(mu), 2) # mean of mean
sdev <- apply(a, 2, sd)  # stdevs
musd <- round(mean(sdev), 2)  # mean of stdevs
plot(mu, sdev, xlab = expression(mu), ylab = expression(sigma), pch = 20, 
     main = expression(paste("Mean (", mu, ") versus StDev (", sigma, ")")))
text(x = 20, y = 10, pos = 4,
     label = substitute(paste("Overall ", mu, "=", a), list(a = mumu)))
text(x = 20, y = 9, pos = 4,
     label = substitute(paste("Overall ", sigma, "=", a), list(a = musd)))

The example above shows how more advanced plot annotations can be done. It also shows that the code can get somewhat complicated; it always takes a ton of time and a number of internet searches to figure out how to do this. We won’t explain this in detail here, but the code provides a template that you can hack at if you want to get fancy with labelling your plots.

2.1.3 Basic terrain variables

terra::terrain allows us to calculate five different terrain variables, four of which we show below:

# Chunk 45
# #1
tzar <- rast(x = ext(districts), crs = crs(districts), res = res(dem))
values(tzar) <- 1:ncell(tzar)
tzar_alb <- project(x = tzar, y = crs(roads), res = 1000, method = "near")
demalb <- project(x = dem, y = tzar_alb)  # default is bilinear

# #2
vars <- c("slope", "aspect", "flowdir", "TRI")
terrvars <-  lapply(1:length(vars), function(x) {
  tv <- terrain(x = demalb, v = vars[x], unit = "degrees")
}) %>% do.call(c, .)
names(terrvars) <- vars

plot_noaxes(terrvars)

We place the DEM in a projected coordinate system for terrain, although they could be in a planar system, provided elevation is given in m, per the help file: > Compute terrain characteristics from elevation data. The elevation values should be in the same units as the map units (typically meter) for projected (planar) raster data. They should be in meter when the coordinate reference system is longitude/latitude.

So we create (#1) a 1 km projected version of the DEM, allowing the default bilinear interpolation to do its work during reprojection.

In #2 we calculate slope, aspect, flow direction, and the topographic ruggedness index. The commented out code illustrates how each of these variables can be calculated separately, although we prefer to do the job using lapply. Note that in the lapply we pass the “unit” argument into the calculation of all four variables, even though it only is used for slope and aspect; this is fine, because terrain simply ignores the argument value in calculating flow direction or TRI.

2.2 Interpolation

Interpolating between point values is a fairly common GIS analysis. The interpolate function of raster allows you to do this using a number of different types of models, such as kriging and inverse distance weighting. We are going to demonstrate both approaches, drawing on functions provided by the gstat package (please install it):

# Chunk 46
load_quietly("gstat") # load gstat package

# #1
raintotalb <- project(x = precip, y = crs(roads), res = 5000)
names(raintotalb) <- "rain"
r <- rast(ext(raintotalb), res = res(raintotalb), 
          crs = crs(raintotalb), vals = 1)

# #2
set.seed(1)
rainsamp <- spatSample(raintotalb, size = 1000, xy = TRUE, na.rm = TRUE)
# head(rainsamp)

# #3
invdist <- gstat(id = "rain", formula = rain ~ 1, locations = ~x + y, 
                 data = rainsamp)
invdistr <- interpolate(object = r, model = invdist, )
#> [inverse distance weighted interpolation]
#> [inverse distance weighted interpolation]
invdistrmsk <- mask(x = invdistr[[1]], mask = raintotalb)

# #4
v <- variogram(object = rain ~ 1, locations = ~x+y, data = rainsamp)  # a
m <- fit.variogram(object = v, model = vgm("Sph"))  # b
m
#>   model    psill    range
#> 1   Nug 25796.39      0.0
#> 2   Sph 97898.75 306389.9

plot(variogramLine(m, max(v[, 2])), type = "l")  # c
points(v[, 2:3], pch = 20)  # d
legend("bottomright", legend = c("variogram fit", "variogram"), 
       lty = c(1, NA), pch = c(NA, 20), bty = "n") # e

# #5
ordkrig <- gstat(id = "rain", formula = rain ~ 1, data = rainsamp, 
                 locations = ~x+y, model = m)
ordkrigr <- interpolate(object = r, model = ordkrig)
#> [using ordinary kriging]
#> [using ordinary kriging]
ordkrigrmsk <- mask(x = ordkrigr[[1]], mask = raintotalb)

raininterp <- c(list(raintotalb, invdistrmsk, ordkrigrmsk))
titles <- c("Actual rain", "IDW rain", "Kriged rain")
par(mfrow = c(2, 2), mar = c(0, 0, 1, 0))
plot(st_as_sf(rainsamp, coords = c("x", "y"))$geometry, pch = 20, cex = 0.5)
for(i in 1:3) plot_noaxes(raininterp[[i]], main = titles[i])

This example is reasonably complex, both conceptually and with respect to code. We will refer you to other, non-R literature to explain how interpolation and kriging work, and confine our explanations to the code.

Block #1 creates an Albers projected version of precip (which is roughly 5 km on a side), and sets up a dummy raster (r) that will be our interpolation target, i.e. it provides the spatial parameters we want our prediction surface to have. #2 then draws a random sample from across the reprojected rainfall raster, providing a set of point observations of rainfall. #3 runs the code for an inverse distance weighted (IDW) interpolation (each cell between the sample points is given a new value that is the weighted mean of the rainfall values collected at the points nearest to it; the weight is determined by the distance to each point). The interpolation model is set up by the gstat function, using the coordinates x and y as the locations. terra::interpolate takes the model and applies it to r, making the new invdistr raster, which we then mask to the outline of Tanzania (invdistrmsk), selecting just the first layer which provides the predicted value.

In #4, we provide the initial steps for implementing ordinary kriging. Line a calculates the variogram, which measures the spatial autocorrelation between rainfall values. Line b fits a spherical variogram model. Lines c-e plot the variogram (changes in semivariances with distance) and the spherical model we fit to estimate that semivariance.

Block #5 takes the fitted variogram model (m) and uses that to create the rainfall surface. The steps in this block are basically the same as in #3, except that we use the kriging model.

2.3 Distance

If you want to calculate how far any point on a grid is from a particular point or points, terra offers a few functions for finding euclidean distances:

# Chunk 47
# #1
set.seed(1)
randsamp <- spatSample(raintotalb, size = 10, xy = TRUE, na.rm = TRUE) %>% 
  st_as_sf(coords = c("x", "y"), crs = st_crs(roads))

# #2
ptdistr <- distance(x = raintotalb, y = randsamp)
ptdistrmsk <- mask(ptdistr, raintotalb)

# #3
randsampr <- rasterize(randsamp, y = raintotalb)
ptdistr2 <- distance(randsampr)
ptdistrmsk2 <- mask(ptdistr2, raintotalb)

s <- c(ptdistrmsk, ptdistrmsk2) / 1000
names(s) <- c("distanceFromPoints", "distance")
par(mfrow = c(1, 2), oma = c(0, 0, 0, 1))
for(i in 1:nlyr(s)) {
  plot_noaxes(s[[i]], main = names(s)[i])
  plot(st_geometry(randsamp), pch = 20, cex = 0.5, add = TRUE)
}

In #1, we draw 10 randomly selected points from raintotalb and create an sf objects called randsamp. Here we are interested just in the spatial locations, not the values.

In #2, we use the distance function to find how far any given point in an input raster (raintotalb) is from the xy locations contained in randsamp. raintotalb serves here simply to provide the dimensions of the raster surface from which we want to calculate distance values. We mask the resulting distance surface (ptdistr) to confine the results to Tanzania’s borders.

Block #3 does pretty much the same thing, but in this case we use the distance function applied to the rasterized points (where the only cells having values are the cells underlying the points that were rasterized–the rest are NA) and calculates the distances from every other cell (the NA values) to the nearest rasterized point having a non-NA value.

The plot shows that the two approaches produce identical results.

Beyond this, there are more sophisticated distance analyses that can be done, such as cost distance analysis, which are provided by the package gdistance.

2.4 Model prediction

Although there are many, many more analyses and capabilities we could illustrate with rasters, we will end this section with a relatively simple modeling example. In this, we will use random forest to predict species distribution.

First, we will subset our species data for the African buffalo (Syncerus caffer) and convert the coordinates into a sf object. We then select three specific variables from our bioclimz dataset: Annual Mean Temperature (bio01), Mean Diurnal Range (bio02), and Annual Precipitation (bio12), to serve as our environmental predictors. Because we only have “presence” data (locations where the buffalo were seen), we must generate random “background” points across the extent of Tanzania to represent pseudo-absences. Finally, we extract the climatic values for both the sightings and the background points using terra::extract, combining them into a single data frame designed for training our model, where presence is coded as 1 and background as 0.

# Chunk 48
# #1. Subset occurrence data for Syncerus caffer
species_pts <- species %>% filter(species == "Syncerus caffer") %>% 
    st_as_sf(coords = c("x", "y"), crs = 4326)

# #2. Select our 3 relevant bioclim predictors
predictors <- bioclimz[[c("bio01", "bio02", "bio12")]]

# #3. Generate 1000 random background (pseudo-absence) points
set.seed(42) # For reproducibility
bg_pts <- spatSample(predictors, size = 1000, method = "random", 
                     as.points = TRUE, na.rm = TRUE)

# #4. Extract environmental values at presence and background locations
pres_env <- terra::extract(predictors, species_pts, ID = FALSE)
bg_env <- terra::extract(predictors, bg_pts, ID = FALSE)

# #5. Combine into a training data frame
# 1 = Presence, 0 = Background
sdm_data <- data.frame(
  pa = c(rep(1, nrow(pres_env)), rep(0, nrow(bg_env))),
  rbind(pres_env, bg_env)
) %>% na.omit()

head(sdm_data)
#>   pa bio01 bio02 bio12
#> 1  1 23.65   9.6   666
#> 2  1 13.25  10.0  2296
#> 3  1 14.35   9.5  1904
#> 4  1 20.75  11.1   839
#> 5  1 17.65   9.6  1517
#> 6  1 21.85  10.9  1624

After doing that, we can do our species distribution modeling:

# Chunk 49
# #6. Fit the Random Forest Model
load_quietly("randomForest") # load randomForest package

mod_rf <- randomForest(as.factor(pa) ~ bio01 + bio02 + bio12, data = sdm_data,
                       importance = TRUE)

mod_rf
#> 
#> Call:
#>  randomForest(formula = as.factor(pa) ~ bio01 + bio02 + bio12,      data = sdm_data, importance = TRUE) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 1
#> 
#>         OOB estimate of  error rate: 12.44%
#> Confusion matrix:
#>     0    1 class.error
#> 0 829  171  0.17100000
#> 1 125 1255  0.09057971

# #7. Predict across all of Tanzania to create a suitability map
suitability_map <- predict(predictors, mod_rf, type = "prob", index = 2)

# #8. Plot the result
plot_noaxes(suitability_map, main = "Predicted Species Suitability (Syncerus caffer)")
points(species_pts, pch = 16, cex = 0.5, col = "red")

To wrap up, we can calculate the AUC (Area Under the Curve) to see how well our model actually performs:

# #9 Accuracy assessment
# 1. Predict values for the training data points to check accuracy
# 'type = "prob"' gives us the probability (0 to 1)
test_preds <- predict(mod_rf, sdm_data, type = "prob")[, 2]

# 2. AUC (Area Under the Curve)
# AUC = 0.5 is random; AUC > 0.8 is generally considered a good model
load_quietly("pROC") # load pROC package

roc_obj <- roc(sdm_data$pa, test_preds)
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
auc_val <- auc(roc_obj)

auc_val
#> Area under the curve: 0.9995
plot(roc_obj, main = "Variable Importance for Species Model")


# #10. Visualize Variable Importance
# This shows which of our 3 bioclim vars was most influential
varImpPlot(mod_rf, main = "Variable Importance for Species Model")

This example demonstrates a very crude species distribution model (SDM) that uses bio01 (annual mean temperature), bio02 (mean diurnal range), and bio12 (annual precipitation) to predict the habitat suitability of African buffalo in Tanzania. In actual ecological work, the modeling process would be significantly more complex to ensure the results are biologically meaningful. Professional workflows typically involve rigorous steps such as accounting for sampling bias, testing for multicollinearity among predictors, and using independent datasets for validation to prevent the overfitting issues mentioned above. Here, we are simply demonstrating the core spatial mechanics of the SDM workflow.

Let’s examine the code a bit:

Block #1 subsets the species data for our target species and converts the coordinates into a spatial sf object. Importantly, we ensure the coordinate reference system (CRS) matches our environmental layers.

Block #2 selects our three relevant bioclimatic predictors from the bioclimz stack. In #3, we generate 1,000 random “background” points (pseudo-absences) across the extent of the predictors. This is necessary because we only have presence records, and the model needs to compare these against the available environmental conditions across Tanzania.

In #4, we use terra::extract to pull the values of our three predictors for both the known sightings and the random background points. We then combine these into a single training data frame, using #5 to create a response variable pa where presences are coded as 1 and background points as 0.

Block #6 (in your next chunk) uses the randomForest function to fit the model. We treat the response as a factor to perform classification, using “bio01, bio02, and bio12” to predict the probability of buffalo occurrence.

Block #7 uses the generic predict function (this one designed to work with SpatRasters) to apply the Random Forest model coefficients to the entire spatial stack of predictors. This creates the suitability map, where each cell represents the probability of habitat suitability for African buffalo. Note that, the variable names in the predictors stack must match the names used in the model training for the function to succeed. We plot the suitability map and the variable importance in #8. The suitability map shows where the climate matches the buffalo’s niche.

Finally, in block #9 and #10, we assessed the model performance. Our model achieved an AUC of 0.9994877, which indicates an exceptionally high ability to distinguish between presence and background locations. But an AUC of 0.9994877 is also a strong signal that we must be careful of overfitting in actual ecological work.

The variable importance plot summarizes two different measures of how much each bioclimatic factor contributes to the model: the Mean Decrease in Accuracy and the Mean Decrease in Gini (node impurity). Accuracy decreases if a variable is shuffled and the model loses its predictive power, whereas Gini impurity declines when a variable is highly effective at splitting the “presence” and “background” points into clean, distinct groups. Looking at the importance plot in #10, we see that bio02 (Mean Diurnal Range) has the highest Mean Decrease in Accuracy, while bio12 (Annual Precipitation) has the highest Mean Decrease in Gini. This suggests that while precipitation is very efficient at creating the initial broad splits in our decision trees, the daily temperature fluctuations are ultimately the most critical for the model’s overall predictive precision.

So that’s about it for this module.

2.5 Practice

2.5.1 Questions

  1. How do we estimate the pixel area of rasters when they are in non-projected coordinate systems?

  2. How can you add mathematical symbols and other special characters (e.g. superscripts) to graphics?

  3. When you want to apply a fitted model to a multi-layer raster (e.g. pred_stack that provide the model predictor layers, what do we have to remember to do with names(pred_stack)?

2.5.2 Code

  1. Crop demalb down to the extent of district 42 (bear in mind that districts needs to be in the same projection as demalb, so transform it first). Then calculate slope, aspect, and TPI on the cropped DEM. Plot the results so that they appear in 3 side by side panels.

  2. Using the code in Chunk 46 blocks #1-#3 as your template, re-create IDW interpolations using the original 1000 randomly sampled points. Then create two new ones based on 1) 500 and 2) 250 randomly selected points (use the same seed). Stacking the results in this order: raintotalb, the 1000 point IDW, the 500 point IDW, and the 250 IDW. Plot these in a single call to plot_noaxes, setting zlim = c(0, 150) and passing informative titles to the “main” argument (e.g. “bioclim”, “1000 pt IDW”, “500 pt IDW”, etc). There are two ways of doing this: i) the hard way, by copying, pasting, and changing code blocks #2 and #3 for each version of the IDW; ii) the programmatic way, by creating the three IDWs in an lapply. Note: if you choose the elegant way, to make the stack, you will either need to c(raintotalb, idw_list) or c(raintotalb, c(idw_list)) stack the list of three idws, and then stack that with raintotalb.

  3. Select (filter) districts 15, 20, 25, 30, 35, 40, 45, and 50 out of districts. Convert the resulting subset of districts to Albers projection, and then extract the centroids of those districts. Using raintotalb as your target, use distance to calculate the distances from any point in Tanzania to the centroids of those selected districts. Mask the result (you can use raintotalb again for the mask target) and plot it using plot_noaxes.

  4. Redo the code in Chunk 48, but use a new random draw of just 100 background to get bg_pts (ref. #3). How much more would the results change compared to a model with denser background?

3 Unit assignment

3.1 Set-up

Make sure you are working in the main branch of your project (you should not be working on the a4 branch). Create a new vignette named “unit2-module2.Rmd”. You will use this to document the tasks you undertake for this assignment.

There are no package functions to create for this assignment. All work is to be done in the vignette.

3.2 Vignette tasks

  1. Create a subset of districts by extracting districts 22, 26, 53, and 54. Call it districts_ss. Use the extent of districts_ss (ext(districts_ss)) to define the extent of a new raster r, which should have a resolution of 0.1°. Use r as a template for creating two new rasters, rsamp and rrandn. rsamp should be filled with randomly selected integers ranging between 10 and 50. rrandn should be filled with random numbers drawn from a normal distribution (rnorm) that has a mean of 30 and standard deviation of 5. Use a seed of 1 in set.seed. Stack rsamp and rrandn (name the stack s), mask that by districts_ss, and plot s using plot_noaxes. (Ref: Chunks 1, 3, 4, 16)

  2. Disaggregate s[[1]] to a resolution of 0.025°, using bilinear interpolation, calling the result s2_1d. Select all areas of s2_1d that have values > 35, creating a new raster s2_1gt35. Set the values of s2_1gt35 that equal 0 to NA. Then convert the resulting raster into an sf object called s2poly. Plot the resulting polygons over s2_1d. (Ref: Chunks 10, 22, 37)

  3. Create a new grid from the extent of districts that has a resolution of 0.5° (call it tzar), assigning all cells a value of 1. Then recreate the speciesr dataset–a raster that sums the number of species falling within each grid cell. Mask the results using districts, and then plot the result onto a grey background of Tanzania. (Ref: Chunk 8, 37)

  4. Convert the rasterized species counts (speciesr) back into an sf points object speciesrpts. Create a new version of tzar at 0.05°, and then calculate the distance between these points and every other location in Tanzania, creating an output grid of distances, called dist_to_species, which you mask by districts. Plot dist_to_species in kilometers (i.e. divide it by 1000) using plot_no_axes, with speciesrpts overlaid as black solid circles. (Ref: Chunks 8, 10, 47)

  5. Use geodata’s worldclim_country function to grab WorldClim’s mean temperature (“tavg”) dataset at a resolution of 2.5 (note this is not degrees, but minutes of a degree), and download it to somewhere on your local disk. That will give a SpatRaster with 12 layers, with each layer representing the average monthly temperature for each grid cell on the planet. Calculate the annual mean temperature for each cell, and then mask the result using districts to get your final raster, tzatmean. Plot the result. (Ref: Chunk 17, 48)

  6. Classify the temperature data into three categories, low, medium, and high, using <20°, 20-24°, and >24° as the break points for determining the classes. Use the reclassify function with a reclassification matrix, which you should do like this:

    trng <- global(tzatmean, range, na.rm = TRUE)
    reclmat <- cbind(c(floor(trng[1]), 20, 24), c(20, 24, ceiling(trng[2])), 1:3)

    Here global is helping to find the values of tmin and tmax, which respectively define the lower bound of the “low” class and the upper bound of the “high” class. What are the functions floor and ceiling doing (answer this in your vignette)? Call the reclassified temperature raster tzatclass. Make the map using plot_noaxes with a categorical legend, and using the colors “blue”, “yellow2”, and “red” for the three classes. (Ref: Chunk 26, 39)

  7. Recreate the precip layer (precip <- bioclimz[["bio12"]]), then calculate the mean precipitation within each temperature zone defined by tzatclass (hint: you will need to resample precip using function resample). Call the resulting matrix z. Map the mean zonal precipitation values in z onto each temperature zone (using the subst function with tzatclass as the target; remember that zonal returns a matrix, and that subst requires equal length vector for the “from” and “to” values. Call the new raster tzaprecz, and then plot it using plot_noaxes, with a custom legend (as done in Task 6), using the rounded zonal mean values (rounded) as the legend labels (legend = round(z$mean)). Use colors “yellow2”, “green3”, and “blue” for the three classes (Ref: Chunks 32, 33, 39)

  8. Use geodata::elevation_30s again to download the elevation raster for Tanzania (call it dem). Aggregate it to the same resolution as bioclim, using the default mean aggregation, and mask it using districts. Call that dem5. Use terrain to calculate aspect from dem5 (call it aspect), selecting degrees as the output value. Then find all west-facing aspects (aspects >247.5 and <292.5), and all east facing aspects (>67.5 and <112.5), making new rasters respectively named west and east, e.g. west <- aspect > 247.5 & aspect < 292.5). Stack these together with aspect and make a three-panel plot with plot_noaxes with titles “Aspect”, “West”, and “East”. (Ref: Chunks 37, 42)

  9. Using a random seed of 1, create two random samples of 100 each. The first one should be collected from within the west-facing cells (i.e. only be drawn from cells in west that have a cell of one), and the second from east-facing cells. To do this, set the cells equal to 0 in east and west to NA (e.g. west[west == 0] <- NA). Once you have collected those, convert the resulting objects to sf, and use those two sets of points to extract temperature values from tzatmean into a tibble temp_stats, which is going to look this:

    temp_stats <- bind_rows(
      tibble(temp = terra::extract(tzatmean, westpts)$mean, dat = "West"), 
      tibble(temp = terra::extract(tzatmean, eastpts)$mean, dat = "East")
    )

    Then use temp_stats with ggplot to make side-by-side boxplots to compare the distributions of west and east facing temperatures, modeled on the example in Chunk 40 #4. (Ref: Chunks 37, 40)

  10. Extra credit worth (5 points). Extract the centroids from each district in districts (call it dcent), and reproject the points to Albers, using the st_crs(roads). Reproject tzatmean to Albers also, making the new resolution (5 km, i.e. 5000 m), using bilinear interpolation (call it tzatmeanalb). Then use dcent to extract the temperature values from tzatmeanalb (add the values to dcent as a new variable “temp” using mutate). Use gstat to create an IDW model (call it idw). To make the IDW work, which isn’t sf compliant, some extra work will be required, as shown below (this is the step needed after the extract of temperature values)

    dcent <- bind_cols(
      dcent %>% data.frame %>% dplyr::select(-geometry) %>% as_tibble, 
      st_coordinates(dcent) %>% as_tibble
    ) %>% rename(x = X, y = Y)

    This yields a tibble with columns x and y that are needed by gstat. After running gstat, map the interpolated temperatures using tzatmeanalb as a target object (it won’t be overwritten) and idw as the model. Make sure you mask the result to the boundaries of Tanzania, using tzatmeanalb as the mask. Call the new interpolated, masked grid tzatidw. Plot the result side by side with tzatmeanalb for comparison using plot_noaxes with titles “Real Temperature” and “IDW Temperature”. (Refs: Chunks 46, 49)

3.3 Assignment output

Refer to Unit 1 Module 4’s assignment output section for submission instructions. The only differences are as follows:

  1. Your submission should be on a new side branch “a5”;
  2. You should increment your package version number by 1 on the lowest number (e.g. from 0.0.2 to 0.0.3) in the DESCRIPTION;
  3. Do not worry about #3 in those instructions

Back to home